Search CORE

8 research outputs found

Computational pan-genomics: status, promises and challenges

Author: Abeel Thomas
Alkan Can
Baaijens Jasmijn
Bakker Paul
Boeva Valentina
Bonnal Raoul
Chiaromonte Francesca
Chikhi Rayan
Ciccarelli Francesca
Cijvat Robin
Datema Erwin
Dijkstra Louis
Duijn Cornelia
Dutilh Bas
Eichler Evan
El-Kebir Mohammed
Ernst Corinna
Eskin Eleazar
Garrison Erik
Ghaffaari Ali
Guryev Victor
Kersey Paul
Klau Gunnar
Kloosterman Wigard
Korbel Jan
Lameijer Eric-Wubbo
Langmead Benjamin
Marschall Tobias
Martin Marcel
Marz Manja
Medvedev Paul
Mu John
Mäkinen Veli
Neerincx Pieter
Novak Adam
Ouwens Klaasjan
Paten Benedict
Peterlongo Pierre
Pisanti Nadia
Porubsky David
Rahmann Sven
Raphael Benjamin
Reinert Knut
Ridder Dick
Ridder Jeroen
Rivals Eric
Sanders Ashley
Schlesner Matthias
Schulz-Trieglaff Ole
Schönhuth Alexander
Sheikhizadeh Siavash
Shneider Carl
Smit Sandra
The Computational Pan-Genomics Consortium
Valenzuela Daniel
Vandin Fabio
Wang Jiayin
Wessels Lodewyk
Ye Kai
Zhang Ying
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2018
Field of study

International audienceMany disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this article, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains

INRIA a CCSD electronic archive server

Archivio della Ricerca - Università di Pisa

EUR Research Repository

HAL-MINES ParisTech

Archivio della ricerca della Scuola Superiore Sant'Anna

Radboud Repository

HAL-Rennes 1

Computational pan-genomics: Status, promises and challenges

Author: Abeel T. (Thomas)
Alkan C. (Can)
Baaijens J.A. (Jasmijn)
Bakker P.I.W. (Paul) de
Boeva V. (Valentina)
Bonnal R.J.P. (Raoul)
Chiaromonte F. (Francesca)
Chikhi R. (Rayan)
Ciccarelli F.D. (Francesca)
Cijvat C.P. (Robin)
Datema E. (Erwin)
Dijkstra L.J. (Louis)
Duijn C.M. (Cornelia) van
Dutilh B.E. (Bas)
Eichler E.E. (Evan)
El-Kebir M. (Mohammed)
Ernst C. (Corinna)
Eskin E. (Eleazar)
Garrison E. (Erik)
Ghaffaari A. (Ali)
Guryev V. (Victor)
Kersey P. (Paul)
Klau G.W. (Gunnar)
Kloosterman W.P. (Wigard)
Korbel J.O. (Jan)
Lameijer E.-W. (Eric-Wubbo)
Langmead B. (Benjamin)
Marschall T. (Tobias)
Martin M. (Marcel)
Marz M. (Manja)
Medvedev P. (Paul)
Mu J.C. (John)
Mäkinen V. (Veli)
Neerincx P.B.T. (Pieter)
Novak A.M. (Adam)
Ouwens K. (Klaasjan)
Paten B. (Benedict)
Peterlongo P. (Pierre)
Pisanti N. (Nadia)
Porubsky D. (David)
Rahmann S. (Sven)
Raphael B.J. (Benjamin)
Reinert K. (Knut)
Ridder D. (Dick) de
Ridder J. (Jeroen) de
Rivals E. (Eric)
Sanders A.D. (Ashley)
Schlesner M. (Matthias)
Schulz-Trieglaff O. (Ole)
Schönhuth A. (Alexander)
Sheikhizadeh S. (Siavash)
Shneider C. (Carl)
Smit S. (Sandra)
The Computational Pan-Genomics Consortium
Valenzuela D. (Daniel)
Vandin F. (Fabio)
Wang J. (Jiayin)
Wessels L.F.A. (Lodewyk)
Ye K. (Kai)
Zhang Y. (Ying)
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2018
Field of study

Many disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different Computational methods and paradigms are needed.We will witness the rapid extension of Computational pan-genomics, a new sub-area of research in Computational biology. In this article, we generalize existing definitions and understand a pangenome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a Computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations

CWI's Institutional Repository

Erasmus University Digital Repository

Towards comparative pan-genomics

Author: Sheikhizadeh Anari Siavash
Publication venue: 'Wageningen University and Research'
Publication date: 01/01/2020
Field of study

Comparative genomics investigates the genomic makeup of species to unravel their unique variations and evolutionary relationships. High-throughput sequencing technologies have enabled reading the DNA content of a wide variety of species at an unprecedented rate. With the ongoing advances in these technologies, many species are or will soon be represented by a large number of genomes. Such genomes can be highly similar, but their differences in sequence and structure are of interest in many applications as they usually underlie specific traits. Having a wealth of genomes for a species, the current practice of basing comparative studies on a single reference genome is neither efficient nor effective. Traditional reference-based approaches make use of only a single reference genome, ignoring the potentially novel genomic content found in other individuals. As a result, over the last decade there has been a growing interest in developing pan-genome structures capable of capturing a wide genomic landscape of species. In this thesis, we develop a pan-genomic platform based on a novel representation of genomes with some functionalities for sequence retrieval, structural annotation, homology detection and read mapping. Chapter 1 briefly introduces molecular biology and the revolution in genome sequencing. Then we introduce evolution and some basic concepts in genomics and comparative genomics which are necessary for the readers to be able to follow the chapters of this thesis. We emphasize the shortcomings of traditional reference-based approaches in comparative genomics and introduce pan-genomics as a solution which recently has received much attention. We introduce the essentials of a pan-genomic platform from the perspective of the Computational Pan-genomics Consortium, and classify existing pan-genomic data structures into two general categories of variation-aware and multi-genome data structures. Finally, we discuss the de Bruijn graph including the stranded version we introduce in chapter 2.       Chapter 2 highlights the necessity of a transition from reference-centric to pan-genomic approaches. As a comprehensive representation of large number of genomes, we introduce a generalized de Bruijn graph. We present a novel algorithm to construct such a DBG and take advantage of the Neo4j graph database for consistent and scalable storage of the graph. We develop a toolset, called PanTools, which provides some useful functionalities e.g. for annotation, graph update and sequence retrieval. We demonstrate the performance of PanTools on large datasets of bacterial, fungal and plant genomes. We illustrate how sequence variation creates specific sub-structures in the pan-genome including an example of the variability of a famous gene, called FRIGIDA, among 19 A. thaliana accessions. Chapter 3 emphasizes the need for highly efficient tools to detect homology in the ever-increasing genomic data. We present an efficient method for detecting homology across a large number of individuals at various evolutionary distances. The presented k-mer based approach considerably reduces the number of alignments between pairs of peptide sequences without sacrificing sensitivity. We demonstrate accuracy, scalability, efficiency and applicability of the presented method in large proteomes of bacteria, fungi, plants and Metazoa. The detected homology groups are stored in the pan-genome graph database, and can be queried, for example, for their size, copy number and conservation rate. Chapter 4 focuses on correcting errors in next-generation sequencing reads which can improve the performance of assembly and increase the accuracy and sensitivity of quantitative analyses such as differential expression analyses and variant calling. We develop a tool, called ACE, based on a k-mer trie data structure to correct for substitution errors in short read data. We show that ACE yields higher gains in terms of coverage depth, outperforming state-of-the-art competitors in the majority of cases, on both MiSeq and HiSeq Illumina data. Chapter 5 presents a multi-genome read mapping approach which utilizes the index and pan-genome structure, introduced in Chapter 2, to map short reads to large number of genomes, simultaneously. One advantage is the efficiency as the joint index enables anchoring the reads to all the genomes at once avoiding repetitive alignments when the genomes are highly similar. Another advantage is that we can resolve the reference bias by including regions that are entirely missing in the reference but present in some other accessions. Moreover, such a multi-genome read mapper can be utilized in binning and abundance estimation of meta-genomic samples. In this chapter, we successfully apply this approach to map genomic and metagenomic reads to large collections of viral, archaeal, bacterial, fungal and plant genomes. Chapter 6 puts forward some ideas on the future challenges and opportunities in the field of pan-genomics. We discuss the emerging shift from reference-centric to pan-genomic approaches and the necessity of substantial adjustments and redevelopments of traditional methods and applications such as genome annotation, structural variation detection and real-time pan-genome visualization. We conclude that the design and engineering introduced in this thesis contributes to the field and the growing number of similar efforts indicates a bright future ahead for comparative pan-genomics.  &nbsp

ACE: accurate correction of errors using K

Author: Dick de Ridder
Molnar
Schulz
Siavash Sheikhizadeh
Publication venue: 'Oxford University Press (OUP)'
Publication date
Field of study

Crossref

Efficient inference of homologs in large eukaryotic pan-proteomes

Author: Ridder Dick, de
Schranz M.E.
Sheikhizadeh Anari Siavash
Smit Sandra
Publication venue
Publication date: 01/01/2018
Field of study

BACKGROUND: Identification of homologous genes is fundamental to comparative genomics, functional genomics and phylogenomics. Extensive public homology databases are of great value for investigating homology but need to be continually updated to incorporate new sequences. As new sequences are rapidly being generated, there is a need for efficient standalone tools to detect homologs in novel data.RESULTS: To address this, we present a fast method for detecting homology groups across a large number of individuals and/or species. We adopted a k-mer based approach which considerably reduces the number of pairwise protein alignments without sacrificing sensitivity. We demonstrate accuracy, scalability, efficiency and applicability of the presented method for detecting homology in large proteomes of bacteria, fungi, plants and Metazoa.CONCLUSIONS: We clearly observed the trade-off between recall and precision in our homology inference. Favoring recall or precision strongly depends on the application. The clustering behavior of our program can be optimized for particular applications by altering a few key parameters. The program is available for public use at https://github.com/sheikhizadeh/pantools as an extension to our pan-genomic analysis tool, PanTools.</p

Directory of Open Access Journals

Wageningen University & Research Publications

FigShare

PanTools: representation, storage and exploration of pan-genomic data

Author: Baier
Deorowicz
Dick de Ridder
M. Eric Schranz
Mehmet Akdel
Sandra Smit
Siavash Sheikhizadeh
Strope
Publication venue: 'Oxford University Press (OUP)'
Publication date
Field of study

Crossref

Efficient inference of homologs in large eukaryotic pan-proteomes

Author: A. J. Enright
AC Roth
AM Altenhoff
AM Altenhoff
BD Ondov
CA Opitz
Dick de Ridder
DM Emms
EM Zdobnov
EV Koonin
F Chen
F Tekaia
H Li
J Huerta-Cepas
J Muller
J Ruan
J Zhu
K Trachana
L Li
M Remm
M. Eric Schranz
PK Strope
RL Tatusov
RM Waterhouse
S Cheng
S Powell
S Sheikhizadeh
Sandra Smit
Siavash Sheikhizadeh Anari
T Marschall
TH Lee
W Ding
X Gan
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref